CS50 Introduction to Artificial Intelligence with Python Lecture 7
第七讲的主题是Lauguage,这里总结第七讲以及第七次作业。
课程地址:https://cs50.harvard.edu/ai/
备注:图片均来自课程课件。
Natural Language Processing
Context-Free Grammar
Context-Free Grammar指定语法规则,例如
NP → N | D N
N → she | city | car | Harry | …
D → the | a | an | …
V → saw | ate | walked | …
P → to | on | over | …
ADJ → blue | busy | old | …
利用Context-Free Grammar会产生解析树:
nltk
$n-$gram
文本中连续$n$项,例如连续$n$个单词或者字母。
tokenization
将字符序列拆分为多个部分(tokens)的任务,常见的token有word token以及sentence token。
bag-of-words model
将文本表示为单词的无序集合的模型。
Naive Bayes
考虑文本分类问题,例如
朴素贝叶斯假设条件独立性,即
其中
additive(Laplace smoothing) smoothing
注意上述参数估计方法会产生很多$0$,所以需要对参数估计平滑化,常见的方式为给每一项增加$\alpha$,如果$\alpha=1$则为拉普拉斯平滑。
information retrieval
信息检索是对用户查询查找相关文档进行反馈的任务。
topic modeling
主题模型是用于发现一组文档主题的模型。
term frequency
词频是术语在文档中出现的次数。
function words
本身没有什么意义的词,但用于语法上连接其他词,例如am, by, do, is, which, with, yet, …
content words
带有独立含义的单词,例如algorithm, category, computer, …
inverse document frequency
逆文档频率衡量文档中单词的普遍或稀有程度,计算公式为
tf-idf
通过将术语频率(TF)乘以逆文档频率(IDF)来对文档中重要单词进行排名。
Word Representation
one-hot representation
- he [1, 0, 0, 0]
- wrote [0, 1, 0, 0]
- a [0, 0, 1, 0]
- book [0, 0, 0, 1]
one-hot表示法无法表达单词之间的关系,并且不具有扩展性,例如增加一个新的单词就需要对全部单词重新编码。
word2vec
该方法利用的思路是一个单词和其上下文关系很大,然后使用神经网络进行训练,最后可以得到如下有趣的结果:
Project
Parser
def judge(word):
flag = False
for s in word:
if s.isalpha():
flag = True
break
return flag
def preprocess(sentence):
"""
Convert `sentence` to a list of its words.
Pre-process sentence by converting all characters to lowercase
and removing any word that does not contain at least one alphabetic
character.
"""
words = nltk.wordpunct_tokenize(sentence)
res = []
for word in words:
word_lower = word.lower()
if judge(word_lower):
res.append(word_lower)
return res
def np_chunk_helper(tree, List):
if tree == None:
return
#计算NP的数量
cnt = 0
tmp = []
for t in tree.subtrees():
if t.label() == "NP":
tmp.append(t)
cnt += 1
#如果只有1个NP则更新
if cnt == 1:
t = tmp[0]
#防止重复
if t not in List:
List.append(t)
return
elif cnt > 1:
for t in tree.subtrees():
#排除自己
if t != tree:
np_chunk_helper(t, List)
return
return
def np_chunk(tree):
"""
Return a list of all noun phrase chunks in the sentence tree.
A noun phrase chunk is defined as any subtree of the sentence
whose label is "NP" that does not itself contain any other
noun phrases as subtrees.
"""
List = []
np_chunk_helper(tree, List)
return List
Questions
def load_files(directory):
"""
Given a directory name, return a dictionary mapping the filename of each
`.txt` file inside that directory to the file's contents as a string.
"""
res = dict()
filenames = os.listdir(directory)
for filename in filenames:
path = os.path.join(directory, filename)
sentence = ""
with open(path, "r", encoding='UTF-8') as file:
for f in file.readlines():
sentence += f.strip() + " "
res[filename] = sentence
return res
def tokenize(document):
"""
Given a document (represented as a string), return a list of all of the
words in that document, in order.
Process document by coverting all words to lowercase, and removing any
punctuation or English stopwords.
"""
tmp = nltk.word_tokenize(document)
words = [t.lower() for t in tmp]
res = []
for word in words:
if word not in string.punctuation and word not in nltk.corpus.stopwords.words("english"):
res.append(word)
return res
def compute_idfs(documents):
"""
Given a dictionary of `documents` that maps names of documents to a list
of words, return a dictionary that maps words to their IDF values.
Any word that appears in at least one of the documents should be in the
resulting dictionary.
"""
res = dict()
n = len(documents)
tmp_documents = dict()
all_words = set()
for filename in documents:
tmp_documents[filename] = set(documents[filename])
all_words.update(tmp_documents[filename])
for word in all_words:
cnt = 0
for filename in tmp_documents:
if word in tmp_documents[filename]:
cnt += 1
res[word] = np.log(n / cnt)
return res
def compute_tf(word, words):
cnt = 0
for w in words:
if w == word:
cnt += 1
return cnt
def top_files(query, files, idfs, n):
"""
Given a `query` (a set of words), `files` (a dictionary mapping names of
files to a list of their words), and `idfs` (a dictionary mapping words
to their IDF values), return a list of the filenames of the the `n` top
files that match the query, ranked according to tf-idf.
"""
tmp = []
for filename in files:
score = 0
for word in query:
if word in idfs:
score += compute_tf(word, files[filename]) * idfs[word]
tmp.append((filename, score))
tmp.sort(key=lambda x: -x[1])
res = []
for i in range(n):
res.append(tmp[i][0])
return res
def compute_cnt(word, sentence):
score = 0
for s in sentence:
if s == word:
score += 1
return score / len(sentence)
#从小到大
def cmp(a, b):
if a[1] != b[1]:
return b[1] - a[1]
else:
return b[2] - a[2]
def top_sentences(query, sentences, idfs, n):
"""
Given a `query` (a set of words), `sentences` (a dictionary mapping
sentences to a list of their words), and `idfs` (a dictionary mapping words
to their IDF values), return a list of the `n` top sentences that match
the query, ranked according to idf. If there are ties, preference should
be given to sentences that have a higher query term density.
"""
tmp = []
for sentence in sentences:
s1 = 0
s2 = 0
for word in query:
if word in sentences[sentence]:
s1 += idfs[word]
s2 += compute_cnt(word, sentences[sentence])
tmp.append((sentence, s1, s2))
tmp = sorted(tmp, key=functools.cmp_to_key(cmp))
res = []
for i in range(n):
res.append(tmp[i][0])
return res